UK Electricity Demand Forecasting¶

This notebook presents a comprehensive study for electricity demand forecasting of the United Kingdom (UK) using historical data. The goal of this project is to develop a robust and accurate model that can predict future electricity demand patterns, enabling better planning and resource allocation.

In [ ]:
# Import general packages
import os
import pandas as pd
import numpy as np
from numpy import ndarray
from datetime import datetime, timedelta
import seaborn as sns
# Local package created to provide clean code
from electricity_forecast.plot_data import interactive_chart, static_chart, distribution_plot, plot_variables, \
    plot_seasonal_day_week, plot_seasonal_month_year, plot_features_demand
from electricity_forecast.data_preprocessing import transform_cyclical_features, tranform_label_features

Data understanding¶

Data collection¶

The dataset provided by the UK National Grid operator includes observations of electricity demand (in megawatts) measured in each half-hour of a day from January 2009 until April 2023. Additionally, weather data was obtained from Visual Crossing weather data services, from January 2009 until December 2023, which includes average daily temperature, humidity, windspeed, and other weather variables. The dataset consists of two separate files: one for electricity demand data and the other for weather data.

The following code read data from files and verifies if data satifies the problem description and requirements.

  • Data is measured twice every hour. Each day contains 48 periods of measurement (2 periods/hour * 24 hours/day = 48 periods/day).
In [ ]:
from electricity_forecast.data_loader import DataLoader

# Load and preprocess the data
print(os.getcwd() + '/data/')
data_loader = DataLoader(os.getcwd() + '/data/')
df_energy, df_weather = data_loader.load_data('original')

display(df_energy.info())
display(df_weather.info())
display(df_energy.head(5))
display(df_weather.head(5))
d:\projects\Forecasting.ElectricityDemand.UK/data/
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250942 entries, 0 to 250941
Data columns (total 21 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   Unnamed: 0                 250942 non-null  int64  
 1   settlement_date            250942 non-null  object 
 2   settlement_period          250942 non-null  int64  
 3   nd                         250942 non-null  int64  
 4   tsd                        250942 non-null  int64  
 5   england_wales_demand       250942 non-null  int64  
 6   embedded_wind_generation   250942 non-null  int64  
 7   embedded_wind_capacity     250942 non-null  int64  
 8   embedded_solar_generation  250942 non-null  int64  
 9   embedded_solar_capacity    250942 non-null  int64  
 10  non_bm_stor                250942 non-null  int64  
 11  pump_storage_pumping       250942 non-null  int64  
 12  ifa_flow                   250942 non-null  int64  
 13  ifa2_flow                  250942 non-null  int64  
 14  britned_flow               250942 non-null  int64  
 15  moyle_flow                 250942 non-null  int64  
 16  east_west_flow             250942 non-null  int64  
 17  nemo_flow                  250942 non-null  int64  
 18  nsl_flow                   75646 non-null   float64
 19  eleclink_flow              75646 non-null   float64
 20  is_holiday                 250942 non-null  int64  
dtypes: float64(2), int64(18), object(1)
memory usage: 40.2+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5478 entries, 0 to 5477
Data columns (total 33 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              5478 non-null   object 
 1   datetime          5478 non-null   object 
 2   tempmax           5478 non-null   float64
 3   tempmin           5478 non-null   float64
 4   temp              5478 non-null   float64
 5   feelslikemax      5478 non-null   float64
 6   feelslikemin      5478 non-null   float64
 7   feelslike         5478 non-null   float64
 8   dew               5478 non-null   float64
 9   humidity          5478 non-null   float64
 10  precip            5478 non-null   float64
 11  precipprob        5478 non-null   float64
 12  precipcover       5478 non-null   float64
 13  preciptype        2702 non-null   object 
 14  snow              5336 non-null   float64
 15  snowdepth         5385 non-null   float64
 16  windgust          2548 non-null   float64
 17  windspeed         5478 non-null   float64
 18  winddir           5478 non-null   float64
 19  sealevelpressure  5194 non-null   float64
 20  cloudcover        5478 non-null   float64
 21  visibility        5478 non-null   float64
 22  solarradiation    4971 non-null   float64
 23  solarenergy       4971 non-null   float64
 24  uvindex           4971 non-null   float64
 25  severerisk        579 non-null    float64
 26  sunrise           5478 non-null   object 
 27  sunset            5478 non-null   object 
 28  moonphase         5478 non-null   float64
 29  conditions        5478 non-null   object 
 30  description       5336 non-null   object 
 31  icon              5478 non-null   object 
 32  stations          5322 non-null   object 
dtypes: float64(24), object(9)
memory usage: 1.4+ MB
None
Unnamed: 0 settlement_date settlement_period nd tsd england_wales_demand embedded_wind_generation embedded_wind_capacity embedded_solar_generation embedded_solar_capacity ... pump_storage_pumping ifa_flow ifa2_flow britned_flow moyle_flow east_west_flow nemo_flow nsl_flow eleclink_flow is_holiday
0 0 2009-01-01 1 37910 38704 33939 54 1403 0 0 ... 33 2002 0 0 -161 0 0 NaN NaN 1
1 1 2009-01-01 2 38047 38964 34072 53 1403 0 0 ... 157 2002 0 0 -160 0 0 NaN NaN 1
2 2 2009-01-01 3 37380 38651 33615 53 1403 0 0 ... 511 2002 0 0 -160 0 0 NaN NaN 1
3 3 2009-01-01 4 36426 37775 32526 50 1403 0 0 ... 589 1772 0 0 -160 0 0 NaN NaN 1
4 4 2009-01-01 5 35687 37298 31877 50 1403 0 0 ... 851 1753 0 0 -160 0 0 NaN NaN 1

5 rows × 21 columns

name datetime tempmax tempmin temp feelslikemax feelslikemin feelslike dew humidity ... solarenergy uvindex severerisk sunrise sunset moonphase conditions description icon stations
0 United Kingdom 2009-01-01 36.9 32.0 34.2 33.0 26.7 30.2 29.9 84.0 ... NaN NaN NaN 2009-01-01T08:06:15 2009-01-01T16:02:16 0.17 Overcast Cloudy skies throughout the day. cloudy 03769099999,03660099999,03672099999,0378109999...
1 United Kingdom 2009-01-02 40.3 30.9 35.7 34.3 25.5 30.8 31.0 83.4 ... NaN NaN NaN 2009-01-02T08:06:07 2009-01-02T16:03:21 0.20 Snow, Rain, Partially cloudy Partly cloudy throughout the day with morning ... rain 03769099999,03660099999,03672099999,0378109999...
2 United Kingdom 2009-01-03 37.5 23.2 28.9 36.0 19.3 26.6 22.7 78.3 ... NaN NaN NaN 2009-01-03T08:05:55 2009-01-03T16:04:29 0.24 Clear Clear conditions throughout the day. clear-day 03769099999,03660099999,03672099999,0378109999...
3 United Kingdom 2009-01-04 32.2 21.6 27.3 27.4 16.0 23.6 23.8 87.0 ... NaN NaN NaN 2009-01-04T08:05:40 2009-01-04T16:05:39 0.25 Partially cloudy Partly cloudy throughout the day. partly-cloudy-day 03769099999,03660099999,03672099999,0378109999...
4 United Kingdom 2009-01-05 35.3 28.4 31.9 27.1 19.8 24.4 26.9 82.3 ... NaN NaN NaN 2009-01-05T08:05:22 2009-01-05T16:06:52 0.31 Snow, Rain, Partially cloudy Partly cloudy throughout the day with afternoo... rain 03769099999,03660099999,03672099999,0378109999...

5 rows × 33 columns

Data format: Electricity demand data requires to remove first column and add time to date for further visual analysis. Weather data is heterogeneous and requires add time to match electricity time format. Electricity demand and weather data were formatted to match datetime column name. Also, some weather variables were selected to simplify the analysis.

In [ ]:
data_loader.format_electricity_data()
data_loader.format_weather_data()
In [ ]:
# Load and preprocess the data
print(os.getcwd() + '/data/')
data_loader = DataLoader(os.getcwd() + '/data/')
df_energy, df_weather, df_datetime = data_loader.load_data('formatted')

display(df_energy.info())
display(df_weather.info())
display(df_energy.head(5))
display(df_weather.head(5))
d:\projects\Forecasting.ElectricityDemand.UK/data/
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250944 entries, 0 to 250943
Data columns (total 18 columns):
 #   Column                     Non-Null Count   Dtype         
---  ------                     --------------   -----         
 0   datetime                   250944 non-null  datetime64[ns]
 1   nd                         250914 non-null  float64       
 2   tsd                        250914 non-null  float64       
 3   england_wales_demand       250914 non-null  float64       
 4   embedded_wind_generation   250914 non-null  float64       
 5   embedded_wind_capacity     250914 non-null  float64       
 6   embedded_solar_generation  250914 non-null  float64       
 7   embedded_solar_capacity    250914 non-null  float64       
 8   non_bm_stor                250914 non-null  float64       
 9   pump_storage_pumping       250914 non-null  float64       
 10  ifa_flow                   250914 non-null  float64       
 11  ifa2_flow                  250914 non-null  float64       
 12  britned_flow               250914 non-null  float64       
 13  moyle_flow                 250914 non-null  float64       
 14  east_west_flow             250914 non-null  float64       
 15  nemo_flow                  250914 non-null  float64       
 16  nsl_flow                   75638 non-null   float64       
 17  eleclink_flow              75638 non-null   float64       
dtypes: datetime64[ns](1), float64(17)
memory usage: 34.5 MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 262897 entries, 0 to 262896
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   datetime      262897 non-null  datetime64[ns]
 1   tempmax       262897 non-null  float64       
 2   tempmin       262897 non-null  float64       
 3   temp          262897 non-null  float64       
 4   feelslikemax  262897 non-null  float64       
 5   feelslikemin  262897 non-null  float64       
 6   feelslike     262897 non-null  float64       
 7   humidity      262897 non-null  float64       
 8   windspeed     262897 non-null  float64       
dtypes: datetime64[ns](1), float64(8)
memory usage: 18.1 MB
None
datetime nd tsd england_wales_demand embedded_wind_generation embedded_wind_capacity embedded_solar_generation embedded_solar_capacity non_bm_stor pump_storage_pumping ifa_flow ifa2_flow britned_flow moyle_flow east_west_flow nemo_flow nsl_flow eleclink_flow
0 2009-01-01 00:00:00 37910.0 38704.0 33939.0 54.0 1403.0 0.0 0.0 0.0 33.0 2002.0 0.0 0.0 -161.0 0.0 0.0 NaN NaN
1 2009-01-01 00:30:00 38047.0 38964.0 34072.0 53.0 1403.0 0.0 0.0 0.0 157.0 2002.0 0.0 0.0 -160.0 0.0 0.0 NaN NaN
2 2009-01-01 01:00:00 37380.0 38651.0 33615.0 53.0 1403.0 0.0 0.0 0.0 511.0 2002.0 0.0 0.0 -160.0 0.0 0.0 NaN NaN
3 2009-01-01 01:30:00 36426.0 37775.0 32526.0 50.0 1403.0 0.0 0.0 0.0 589.0 1772.0 0.0 0.0 -160.0 0.0 0.0 NaN NaN
4 2009-01-01 02:00:00 35687.0 37298.0 31877.0 50.0 1403.0 0.0 0.0 0.0 851.0 1753.0 0.0 0.0 -160.0 0.0 0.0 NaN NaN
datetime tempmax tempmin temp feelslikemax feelslikemin feelslike humidity windspeed
0 2009-01-01 00:00:00 36.9 32.0 34.2 33.0 26.7 30.2 84.0 5.9
1 2009-01-01 00:30:00 36.9 32.0 34.2 33.0 26.7 30.2 84.0 5.9
2 2009-01-01 01:00:00 36.9 32.0 34.2 33.0 26.7 30.2 84.0 5.9
3 2009-01-01 01:30:00 36.9 32.0 34.2 33.0 26.7 30.2 84.0 5.9
4 2009-01-01 02:00:00 36.9 32.0 34.2 33.0 26.7 30.2 84.0 5.9

Partial conclusion: All data fits the requirements stated in the problem. The datetime column in each file has information of the date and time, which could help with visualization of the data.

Next step is to explore the data to get some insights that could help understand how to forecast electricity demand of the UK with this data.

Exploratory data analysis (EDA)¶

Perform exploratory data analysis (EDA) to understand the distribution, relationships, and patterns in the data. Visualizations can help identify insights and guide feature engineering.

In [ ]:
# Print a summary of the dataset
print("Dataset Summary:")
display(df_energy.describe())
display(df_weather.describe())
Dataset Summary:
datetime nd tsd england_wales_demand embedded_wind_generation embedded_wind_capacity embedded_solar_generation embedded_solar_capacity non_bm_stor pump_storage_pumping ifa_flow ifa2_flow britned_flow moyle_flow east_west_flow nemo_flow nsl_flow eleclink_flow
count 250944 250914.000000 250914.000000 250914.000000 250914.000000 250914.000000 250914.000000 250914.000000 250914.000000 250914.000000 250914.000000 250914.00000 250914.000000 250914.000000 250914.000000 250914.000000 75638.000000 75638.000000
mean 2016-02-27 23:45:00.000001024 31828.702882 33215.576309 28954.201954 1212.054525 4210.098966 796.052588 7755.611014 7.406059 320.662071 918.987211 11.42723 542.740078 -107.847824 -25.244988 150.901923 184.870753 -54.207184
min 2009-01-01 00:00:00 13367.000000 0.000000 0.000000 0.000000 1403.000000 0.000000 0.000000 -24.000000 0.000000 -2056.000000 -1030.00000 -1215.000000 -505.000000 -585.000000 -1022.000000 -1455.000000 -1028.000000
25% 2012-07-30 23:52:30 25619.000000 27238.000000 23297.000000 520.000000 2085.000000 0.000000 1819.000000 0.000000 8.000000 208.250000 0.00000 0.000000 -251.000000 -127.000000 0.000000 0.000000 0.000000
50% 2016-02-27 23:45:00 31266.000000 32463.000000 28422.000000 971.000000 4152.000000 0.000000 9300.000000 0.000000 12.000000 1246.000000 0.00000 766.000000 -120.000000 0.000000 0.000000 0.000000 0.000000
75% 2019-09-26 23:37:30 37537.000000 38690.000000 34174.000000 1648.000000 6192.000000 730.000000 13080.000000 0.000000 466.000000 1899.000000 0.00000 994.000000 45.000000 5.000000 0.000000 366.000000 0.000000
max 2023-04-25 23:30:00 59095.000000 60147.000000 53325.000000 5354.000000 6574.000000 9830.000000 13861.000000 893.000000 2019.000000 2066.000000 1016.00000 1143.000000 499.000000 504.000000 1033.000000 1401.000000 1002.000000
std NaN 7768.776368 7697.950323 7043.248055 925.835390 1925.165504 1594.304883 5484.671920 41.235912 544.239312 1092.471718 307.62554 507.435010 223.073430 250.977542 387.718689 489.713956 362.135756
datetime tempmax tempmin temp feelslikemax feelslikemin feelslike humidity windspeed
count 262897 262897.000000 262897.000000 262897.000000 262897.000000 262897.000000 262897.000000 262897.000000 262897.000000
mean 2016-07-01 12:00:00.000000768 59.935904 47.646589 53.623939 57.563539 44.124298 52.007846 75.021550 13.102991
min 2009-01-01 00:00:00 29.900000 19.100000 26.000000 0.000000 0.000000 16.700000 36.000000 2.800000
25% 2012-10-01 06:00:00 51.200000 41.100000 46.400000 50.600000 35.800000 42.600000 67.500000 9.900000
50% 2016-07-01 12:00:00 59.600000 47.900000 53.500000 59.100000 44.500000 53.000000 75.900000 12.400000
75% 2020-03-31 18:00:00 68.300000 54.800000 61.200000 68.300000 54.600000 61.300000 82.900000 15.700000
max 2023-12-31 00:00:00 103.700000 71.800000 86.500000 100.700000 71.800000 90.900000 98.900000 39.500000
std NaN 11.472219 9.017741 9.805832 15.667978 13.092974 11.913379 10.440113 4.679502
In [ ]:
# Plot electricity demand
# interactive_chart(df_energy, start_date=datetime(2009, 1, 1), end_date=datetime(2023, 4, 30)) # High computational load
static_chart(df_energy, start_date=datetime(2009, 1, 1), end_date=datetime(2023, 4, 30))

Seasonal info: The electricity demand chart shows seasonality, so it will be helpfull to consider decompose the time into features that are related to this variable. Thus, some datetime features are extracted related to year, month, day of year, among others. Distribution plots, box plots and violin plots are presented to help with the analysis.

In [ ]:
# Create date and time features to analyze seasonality
data_loader.extract_time_features()
df_energy, df_weather, df_datetime = data_loader.load_data('formatted')
display(df_datetime.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250944 entries, 0 to 250943
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   datetime        250944 non-null  datetime64[ns]
 1   date_period     250944 non-null  float64       
 2   date_isholiday  250944 non-null  bool          
 3   date_year       250944 non-null  int64         
 4   date_month      250944 non-null  int64         
 5   date_day        250944 non-null  int64         
 6   date_hour       250944 non-null  int64         
 7   date_minute     250944 non-null  int64         
 8   date_dayofyear  250944 non-null  int64         
 9   date_weekday    250944 non-null  int64         
 10  date_quarter    250944 non-null  int64         
 11  date_week       250944 non-null  int64         
 12  date_isweekend  250944 non-null  bool          
 13  date_daypart    250944 non-null  object        
dtypes: bool(2), datetime64[ns](1), float64(1), int64(9), object(1)
memory usage: 23.5+ MB
None
In [ ]:
# Plot electricity demand variables
plot_variables(df_energy, df_energy.columns[1:], start_date=datetime(2009, 1, 1), end_date=datetime(2023, 4, 30))

Based on the previous chart, only the tsd column is considered significant for this task as it represents the target variable (electricity demand). The remaining variables cannot be used as features for the prediction since they are measured at the same time and are not predicted values.

The next code will explore the influence of time features on the demand.

In [ ]:
# Analize seasonal components from day and week
df_energy_time = pd.merge(df_datetime, df_energy[['datetime', 'tsd']], on='datetime', how='left')
plot_seasonal_day_week(df_energy_time)

Hourly and Daily Seasonal analysis

There is a clear pattern of electricity demands that depends on the period during the day. This pattern shows how the demand is higher during the afternoon and the evening, and lower at night and in the morning. The demand is similar during weekdays, and different between weekdays and weekends. The reason for this pattern could be the stop from electricity consumption for enterprises during weekends.

The weekend vs holiday chart indicate that on weekends the electricity demand is similar between holdays and non-holidays. However, on weekdays there is a difference between holdays and non-holidays, and holidays values are similar to weekends. This indicates that perhaps weekends and holidays could have the same effects on electricity demand, so it would be interesting exploring this approach in future analysis.

The hour and the half-hour period charts show a cyclical pattern. This means that the demand at the beginning is similar to the demand at the end. For example, the demand at 11pm is similar to the demand at midnight. The other features don't show the same pattern.

Also, minutes hold no relevance in the demand. The half hour periods can be seen as a detailed view of the hourly demand.

Let's analyze now the seasonal month and year components.

In [ ]:
# Analize seasonal components from month and year
plot_seasonal_month_year(df_energy_time)

Monthly and Yearly Seasonal analysis

The day of the month chart indicates that this feature does not significantly influence electricity demand. As a result, it could be excluded from the forecast model.

On the other hand, the day of the year, week, month, and quarter exhibit a significant pattern of lower demand during summer and higher demand during winter. This behavior is attributed to the increased energy consumption for heating buildings during the colder months, particularly in the UK where winter temperatures can drop below 20 degrees Fahrenheit (-6.67 degrees Celsius). These features also demonstrate a cyclical pattern, meaning that the demand at the beginning of the year is similar to that at the end. For instance, the demand in January resembles the demand in December. An interesting observation is that these features, presented in the given order, show the same cyclical demand pattern but at progressively more detailed levels, from broader time scales (year) to more granular ones (quarter, month, week).

The year components show a decreasing pattern of the demand, which is consistent with the electricity reduction of the UK stated in the news.

Partial conclusions: All time features at and above the hour level have some degree of impact on the demand, aligning with the time-series approach adopted to address this problem. Hence, these features should be taken into account when constructing the forecasting model.

The next step is to handle missing values and outliers. Thus, the distribution of the demand is plotted.

In [ ]:
# Plot electricity demand distribution
distribution_plot(df_energy)
In [ ]:
# Detect missing data
datetime_nan = df_energy.loc[df_energy['tsd'].isna(), 'datetime']
print('Timestamps with nan values:')
display(datetime_nan)
datetime_zero = df_energy.loc[df_energy['tsd'] == 0, 'datetime']
print('Timestamps with zero values:')
display(datetime_zero)
Timestamps with nan values:
4222     2009-03-29 23:00:00
4223     2009-03-29 23:30:00
21694    2010-03-28 23:00:00
21695    2010-03-28 23:30:00
39166    2011-03-27 23:00:00
39167    2011-03-27 23:30:00
56638    2012-03-25 23:00:00
56639    2012-03-25 23:30:00
74446    2013-03-31 23:00:00
74447    2013-03-31 23:30:00
91918    2014-03-30 23:00:00
91919    2014-03-30 23:30:00
109390   2015-03-29 23:00:00
109391   2015-03-29 23:30:00
126862   2016-03-27 23:00:00
126863   2016-03-27 23:30:00
144334   2017-03-26 23:00:00
144335   2017-03-26 23:30:00
161806   2018-03-25 23:00:00
161807   2018-03-25 23:30:00
179614   2019-03-31 23:00:00
179615   2019-03-31 23:30:00
197086   2020-03-29 23:00:00
197087   2020-03-29 23:30:00
214558   2021-03-28 23:00:00
214559   2021-03-28 23:30:00
232030   2022-03-27 23:00:00
232031   2022-03-27 23:30:00
249502   2023-03-26 23:00:00
249503   2023-03-26 23:30:00
Name: datetime, dtype: datetime64[ns]
Timestamps with zero values:
11522   2009-08-29 01:00:00
11523   2009-08-29 01:30:00
11524   2009-08-29 02:00:00
11525   2009-08-29 02:30:00
11526   2009-08-29 03:00:00
                ...        
60380   2012-06-11 22:00:00
60381   2012-06-11 22:30:00
60382   2012-06-11 23:00:00
60383   2012-06-11 23:30:00
66337   2012-10-14 00:30:00
Name: datetime, Length: 479, dtype: datetime64[ns]

Missing Data and Outliers

Some measurements are missing from the last two periods of one of the last days of March every year. There appears to be a discernible pattern in the missing data, possibly related to the data acquisition system. To address this issue, it is possible to replace the missing data based on the demand distribution for each period. Additionally, measurements equal to zero are considered missing or outliers and seem to belong to entire days. These zero measurements can also be replaced.

However, the initial approach is to remove observations with missing values for electricity demand. In further analysis, another approach could involve removing the days that have some periods with missing values, or applying imputation strategies to fill in the missing values.

In [ ]:
# Handling missing data
data_loader.handle_missing_data('remove_time')
df_energy, df_weather, df_datetime = data_loader.load_data('no_missing')

Weather data

The next step is to analize weather patterns and relationship with the electricity demand.

In [ ]:
# Plot weather variables over time
plot_variables(df_weather, df_weather.columns[1:], start_date=datetime(2009, 1, 1), end_date=datetime(2023, 4, 30))

It appears that weather data also has seasonal components. Let's plot the electricity demand against weather features to get some more insights.

In [ ]:
# Plot electricity demand vs weather features
df_energy_weather = pd.merge(df_energy[['datetime', 'tsd']], df_weather, on='datetime', how='left')
plot_features_demand(df_energy_weather, df_weather.columns[1:])

As assumed at the beginning, weather data is related to electricity demand. It appears that temperature is the variable with the most influence on the demand. Therefore, all weather features should be included in the forecasting model.

Data transformation¶

In the process of building the forecast model, data transformation is necessary to create features that can be utilized by the models. Additionally, some new features may be generated. While each model has specific data requirements, there is a common stage of data transformation that is generally applied. This stage involves the following steps:

  • Adding new features by transforming cyclical time features, such as half-hour periods and day of the year.
  • Normalizing features to ensure consistency and uniformity in their scales.
In [ ]:
# Verify time file format
display(df_datetime.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250944 entries, 0 to 250943
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   datetime        250944 non-null  datetime64[ns]
 1   date_period     250944 non-null  float64       
 2   date_isholiday  250944 non-null  bool          
 3   date_year       250944 non-null  int64         
 4   date_month      250944 non-null  int64         
 5   date_day        250944 non-null  int64         
 6   date_hour       250944 non-null  int64         
 7   date_minute     250944 non-null  int64         
 8   date_dayofyear  250944 non-null  int64         
 9   date_weekday    250944 non-null  int64         
 10  date_quarter    250944 non-null  int64         
 11  date_week       250944 non-null  int64         
 12  date_isweekend  250944 non-null  bool          
 13  date_daypart    250944 non-null  object        
dtypes: bool(2), datetime64[ns](1), float64(1), int64(9), object(1)
memory usage: 23.5+ MB
None
In [ ]:
# Create cyclical features
cyclical_features = ['date_period', 'date_month', 'date_hour', 'date_dayofyear', 'date_quarter', 'date_week']
df_cyclical = transform_cyclical_features(df_datetime, cyclical_features)
df_cyclical['datetime'] = df_datetime['datetime']
df_cyclical.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250944 entries, 0 to 250943
Data columns (total 13 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   date_period_sin     250944 non-null  float64       
 1   date_period_cos     250944 non-null  float64       
 2   date_month_sin      250944 non-null  float64       
 3   date_month_cos      250944 non-null  float64       
 4   date_hour_sin       250944 non-null  float64       
 5   date_hour_cos       250944 non-null  float64       
 6   date_dayofyear_sin  250944 non-null  float64       
 7   date_dayofyear_cos  250944 non-null  float64       
 8   date_quarter_sin    250944 non-null  float64       
 9   date_quarter_cos    250944 non-null  float64       
 10  date_week_sin       250944 non-null  float64       
 11  date_week_cos       250944 non-null  float64       
 12  datetime            250944 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(12)
memory usage: 24.9 MB
In [ ]:
# Plot cyclical features
df_energy_cyclical = pd.merge(df_energy[['datetime', 'tsd']], df_cyclical, on='datetime', how='left')
plot_features_demand(df_energy_cyclical, df_cyclical.columns[:-1])

Cyclical features: From the cyclical features chart can be asume that not all cyclical features are representative when transformed with sine and cosine. It can be expected that date_dayofyear_cos and date_week_cos are the features that will give more information on the forecast.

Considering these charts, all time features, original and transformed will be included in the model.

The next step is to transform boolean and string features into integers.

In [ ]:
# Convert is weekend and is holiday features
df_datetime['date_isweekend'] = df_datetime['date_isweekend'].astype(int)
df_datetime['date_isholiday'] = df_datetime['date_isholiday'].astype(int)
In [ ]:
# Convert part of date to int
df_datetime['date_daypart'] = tranform_label_features(df_datetime, ['date_daypart'])
df_datetime.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250944 entries, 0 to 250943
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   datetime        250944 non-null  datetime64[ns]
 1   date_period     250944 non-null  float64       
 2   date_isholiday  250944 non-null  int32         
 3   date_year       250944 non-null  int64         
 4   date_month      250944 non-null  int64         
 5   date_day        250944 non-null  int64         
 6   date_hour       250944 non-null  int64         
 7   date_minute     250944 non-null  int64         
 8   date_dayofyear  250944 non-null  int64         
 9   date_weekday    250944 non-null  int64         
 10  date_quarter    250944 non-null  int64         
 11  date_week       250944 non-null  int64         
 12  date_isweekend  250944 non-null  int32         
 13  date_daypart    250944 non-null  int32         
dtypes: datetime64[ns](1), float64(1), int32(3), int64(9)
memory usage: 23.9 MB

Partial conclusions: The selected features for the forecast model include time features, cyclical features, and weather features. In the modeling step, a feature selection process will be executed to reduce the dimensionality of the feature space.

In [ ]:
# Save features dataframe
df_features = pd.merge(df_datetime, df_cyclical, on='datetime', how='left')
df_features = pd.merge(df_features, df_weather, on='datetime', how='left')
data_loader.save_data('features_2009_2023.csv', df_features)
In [ ]:
# Save demand dataframe
df_demand = df_energy[['datetime', 'tsd']]
data_loader.save_data('demand_2009_2023.csv', df_demand)

Model creation¶

In this section, the chosen models are fitted to predict electricity demand in order to identify models with higher performance. To facilitate this task, we begin by importing necessary packages.

In [ ]:
# FbProphet model
from prophet import Prophet
from prophet.serialize import model_to_json, model_from_json

# XGB regression model
from xgboost import XGBRegressor

# Neural network model
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import RMSprop
from tensorflow_probability.python.layers import IndependentNormal
from keras.metrics import MeanAbsoluteError, MeanAbsolutePercentageError
from keras.callbacks import EarlyStopping

# For time series splitting
from sklearn.model_selection import TimeSeriesSplit
from electricity_forecast.plot_data import plot_timeseries_split

# For model evaluation
from sklearn.metrics import mean_absolute_percentage_error, mean_absolute_error, mean_squared_error

from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel

Model structures¶

In model creation, experiment were design and executed with different models structures to find the best fit for the problem. These models are:

  • Generalized Additive Model with seasonality: Facebook Prophet (FbProphet)
  • Ensemble regression model: Extreme Gradient Boosting (XGB)
  • Neural network: Gaussian Multilayer Perceptron (GMLP)

FbProphet¶

In [ ]:
# Define the model
prophet_model = Prophet()
prophet_model.add_country_holidays(country_name='GB')
# Add weather features to the model
for feature in df_weather.columns[1:]:
    prophet_model.add_regressor(feature)

XGB¶

In [ ]:
# Define the model
xgb_model = XGBRegressor(n_estimators=500, subsample=0.7, max_depth=5)

GMLP¶

In [ ]:
# Define loss function for the network
# Negative log likelihood
def negloglik(y_true, y_pred):
    return -y_pred.log_prob(y_true)

# The model is defined before training

Model training¶

In this step, the train, test and validation sets are created.

In [ ]:
# Split test set
test_split = datetime(2023, 1, 1, 0, 0)
X_test = df_features.loc[df_features['datetime'] >= test_split]
y_test = df_demand.loc[df_demand['datetime'] >= test_split]
In [ ]:
# Split train and validation set
validation_split = datetime(2022, 1, 1, 0, 0)
X_train = df_features.loc[df_features['datetime'] < validation_split]
y_train = df_demand.loc[df_demand['datetime'] < validation_split]
X_val = df_features.loc[(df_features['datetime'] >= validation_split) & (df_features['datetime'] < test_split)]
y_val = df_demand.loc[(df_demand['datetime'] >= validation_split) & (df_demand['datetime'] < test_split)]
In [ ]:
# Cross validation split
n_years_test = 1
tss = TimeSeriesSplit(n_splits=5, test_size=48 * 365 * n_years_test, gap=48)
plot_timeseries_split(df_demand, tss, test_split)
In [ ]:
# Set saving directory for models
model_save_dir = 'model/'
if not os.path.exists(model_save_dir):
    os.mkdir(model_save_dir)

Train FbProphet¶

In [ ]:
model_name = 'fbprophet'
model_file = model_save_dir + 'model_' + model_name + '.json'
if not os.path.exists(model_file):
    print('Training model')
    train_prophet = pd.merge(y_train, X_train[df_weather.columns], on='datetime', how='left')
    train_prophet.rename(columns={'datetime': 'ds', 'tsd': 'y'}, inplace=True)
    prophet_model.fit(train_prophet)
    print('Saving model')
    with open(model_file, 'w') as fout:
        fout.write(model_to_json(prophet_model))  # Save model
else:
    print('Loading model')
    with open(model_file, 'r') as fin:
        prophet_model = model_from_json(fin.read())  # Load model
Loading model
In [ ]:
print('Testing model')
val_prophet = pd.merge(y_val, X_val[df_weather.columns], on='datetime', how='left')
val_prophet.rename(columns={'datetime': 'ds', 'tsd': 'y'}, inplace=True)
# use the model to make a forecast
val_pred_prophet = prophet_model.predict(val_prophet)
# summarize the forecast
print(val_pred_prophet[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].head())
# plot forecast
prophet_model.plot(val_pred_prophet)
from matplotlib import pyplot as plt
# plt.scatter(test_prophet['ds'], test_prophet['y'])
plt.show()
Testing model
                   ds          yhat    yhat_lower    yhat_upper
0 2022-01-01 00:00:00  18543.137484  15310.722410  21636.183191
1 2022-01-01 00:30:00  17976.745986  14726.250332  21180.560981
2 2022-01-01 01:00:00  17529.476476  14422.087231  20626.766335
3 2022-01-01 01:30:00  17120.734880  13840.901348  20215.794961
4 2022-01-01 02:00:00  16695.922696  13583.588698  19723.230036
In [ ]:
print('Evaluate model')
mse = mean_squared_error(val_prophet['y'], val_pred_prophet['yhat'])
rmse = np.sqrt(mean_squared_error(val_prophet['y'], val_pred_prophet['yhat']))
mae = mean_absolute_error(val_prophet['y'], val_pred_prophet['yhat'])
mape = mean_absolute_percentage_error(val_prophet['y'], val_pred_prophet['yhat'])*100
print("MAE: %.2f" % (mse))
print("RMSE: %.2f" % (rmse))
print("MAE: %.2f" % (mae))
print("MAPE: %.2f%%" % (mape))
Evaluate model
MAE: 12090957.17
RMSE: 3477.21
MAE: 2636.98
MAPE: 9.63%

Train XGB¶

In [ ]:
model_name = 'xgb'
model_file = model_save_dir + 'model_' + model_name + '.model'
train_xgb = pd.merge(y_train, X_train, on='datetime', how='left')
scaler = StandardScaler()
scaler.fit(train_xgb[X_train.columns[1:]])
if not os.path.exists(model_file):
    print('Training model')
    xgb_model = xgb_model.fit(scaler.transform(train_xgb[X_train.columns[1:]]), train_xgb[y_train.columns[1:]])
    print('Saving model')
    xgb_model.save_model(model_file)
else:
    print('Loading model')
    xgb_model.load_model(model_file)
Loading model
In [ ]:
print('Testing model')
# use the model to make a forecast
val_xgb = pd.merge(y_val, X_val, on='datetime', how='left')
val_pred_xgb = xgb_model.predict(scaler.transform(val_xgb[X_val.columns[1:]]))
plot_count = 1440
# plot prediction
plt.figure(figsize=(15,5))
plt.plot(val_xgb['datetime'].iloc[:plot_count], val_pred_xgb[:plot_count])
plt.scatter(val_xgb['datetime'].iloc[:plot_count], val_pred_xgb[:plot_count])
plt.show()

print('Evaluate model')
mse = mean_squared_error(y_val['tsd'], val_pred_xgb)
rmse = np.sqrt(mean_squared_error(y_val['tsd'], val_pred_xgb))
mae = mean_absolute_error(y_val['tsd'], val_pred_xgb)
mape = mean_absolute_percentage_error(y_val['tsd'], val_pred_xgb)*100
print("MSE: %.2f" % (mse))
print("RMSE: %.2f" % (rmse))
print("MAE: %.2f" % (mae))
print("MAPE: %.2f%%" % (mape))
Testing model
Evaluate model
MSE: 4609979.45
RMSE: 2147.09
MAE: 1718.80
MAPE: 6.01%
In [ ]:
print('Feature importance')
feature_names = X_train.columns[1:]
importances = xgb_model.feature_importances_.reshape(-1,1)
model_importances = pd.DataFrame(importances, columns=['importance'], index=feature_names)
model_importances.sort_values(by='importance', ascending=False, inplace=True)
model_importances.to_csv(model_save_dir + 'features_' + model_name + '.csv')

fig, ax = plt.subplots()
model_importances['importance'].plot.bar(ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
Feature importance

Feature Selection: This model can be employed to select features for training other regression models. The code below accomplishes this task.

In [ ]:
print('Feature selection')
# Fit model using each importance as a threshold
thresholds = np.sort(xgb_model.feature_importances_)
print('Retraining model')
for thresh in thresholds:
    # select features using threshold
    selection = SelectFromModel(xgb_model, threshold=thresh, prefit=True)
    select_X_train = selection.transform(scaler.transform(train_xgb[X_train.columns[1:]]))
    # train model
    selection_model_file = model_save_dir + 'model_' + model_name + '_th_{:.5f}'.format(thresh) + '.model'
    selection_model = XGBRegressor(n_estimators=500, subsample=0.7, max_depth=5)
    if os.path.exists(selection_model_file):
        selection_model.load_model(selection_model_file)
    else:
        selection_model.fit(select_X_train, train_xgb[y_train.columns[1:]])
        selection_model.save_model(selection_model_file)
    # eval model
    select_X_val = selection.transform(scaler.transform(val_xgb[X_val.columns[1:]]))
    predictions = selection_model.predict(select_X_val)
    mae = mean_absolute_error(y_val['tsd'], predictions)
    mape = mean_absolute_percentage_error(y_val['tsd'], predictions)*100
    rmse = np.sqrt(mean_squared_error(y_val['tsd'], predictions))
    print("Thresh=%.5f, n=%d, MAE: %.2f, MAPE: %.2f%%, RMSE: %.2f" % (thresh, select_X_train.shape[1], mae, mape, rmse))
Feature selection
Retraining model
Thresh=0.00000, n=33, MAE: 1718.80, MAPE: 6.01%, RMSE: 2147.09
Thresh=0.00000, n=33, MAE: 1718.80, MAPE: 6.01%, RMSE: 2147.09
Thresh=0.00000, n=33, MAE: 1718.80, MAPE: 6.01%, RMSE: 2147.09
Thresh=0.00053, n=30, MAE: 1718.80, MAPE: 6.01%, RMSE: 2147.09
Thresh=0.00112, n=29, MAE: 1713.54, MAPE: 5.99%, RMSE: 2134.43
Thresh=0.00120, n=28, MAE: 1716.30, MAPE: 6.00%, RMSE: 2137.62
Thresh=0.00143, n=27, MAE: 1677.97, MAPE: 5.86%, RMSE: 2089.65
Thresh=0.00156, n=26, MAE: 1722.73, MAPE: 6.02%, RMSE: 2141.26
Thresh=0.00165, n=25, MAE: 1678.72, MAPE: 5.89%, RMSE: 2086.31
Thresh=0.00243, n=24, MAE: 1676.84, MAPE: 5.85%, RMSE: 2096.71
Thresh=0.00256, n=23, MAE: 1706.15, MAPE: 5.97%, RMSE: 2125.37
Thresh=0.00294, n=22, MAE: 1620.24, MAPE: 5.66%, RMSE: 2030.68
Thresh=0.00337, n=21, MAE: 1640.32, MAPE: 5.73%, RMSE: 2066.48
Thresh=0.00350, n=20, MAE: 1690.26, MAPE: 5.90%, RMSE: 2113.86
Thresh=0.00389, n=19, MAE: 1679.54, MAPE: 5.86%, RMSE: 2109.93
Thresh=0.00572, n=18, MAE: 1722.80, MAPE: 6.01%, RMSE: 2166.05
Thresh=0.00588, n=17, MAE: 1705.73, MAPE: 5.96%, RMSE: 2170.09
Thresh=0.00608, n=16, MAE: 1672.67, MAPE: 5.84%, RMSE: 2114.26
Thresh=0.00707, n=15, MAE: 1712.55, MAPE: 5.98%, RMSE: 2150.71
Thresh=0.00816, n=14, MAE: 1724.51, MAPE: 6.02%, RMSE: 2165.35
Thresh=0.00986, n=13, MAE: 1730.37, MAPE: 6.05%, RMSE: 2183.54
Thresh=0.01484, n=12, MAE: 1737.90, MAPE: 6.08%, RMSE: 2200.22
Thresh=0.02215, n=11, MAE: 1746.89, MAPE: 6.13%, RMSE: 2209.54
Thresh=0.03067, n=10, MAE: 1706.30, MAPE: 5.98%, RMSE: 2155.33
Thresh=0.03797, n=9, MAE: 1740.24, MAPE: 6.11%, RMSE: 2228.04
Thresh=0.04493, n=8, MAE: 1742.20, MAPE: 6.13%, RMSE: 2196.71
Thresh=0.04535, n=7, MAE: 1790.80, MAPE: 6.30%, RMSE: 2263.86
Thresh=0.08226, n=6, MAE: 1830.29, MAPE: 6.44%, RMSE: 2295.78
Thresh=0.08327, n=5, MAE: 4314.65, MAPE: 14.65%, RMSE: 5213.82
Thresh=0.09236, n=4, MAE: 4781.58, MAPE: 16.25%, RMSE: 5549.52
Thresh=0.14227, n=3, MAE: 4270.03, MAPE: 15.07%, RMSE: 5302.25
Thresh=0.15468, n=2, MAE: 4282.35, MAPE: 15.13%, RMSE: 5339.39
Thresh=0.18032, n=1, MAE: 4831.83, MAPE: 18.19%, RMSE: 6149.94
In [ ]:
# Select features based on threshold
selected_threshold = 0.0029
selected_features = model_importances.loc[model_importances['importance'] >= selected_threshold]
selected_features.to_csv(model_save_dir + 'features_' + model_name + '_th_{:.5f}'.format(selected_threshold) + '.csv')
selected_features.index
Out[ ]:
Index(['date_dayofyear_cos', 'date_period', 'date_week_cos', 'date_weekday',
       'tempmax', 'date_year', 'date_isholiday', 'temp', 'date_daypart',
       'date_hour_sin', 'date_period_sin', 'date_dayofyear', 'date_week',
       'date_month_cos', 'date_month_sin', 'date_quarter_cos', 'date_hour_cos',
       'date_period_cos', 'humidity', 'date_quarter_sin', 'tempmin',
       'date_week_sin'],
      dtype='object')
In [ ]:
print('Loading model')
xgb_model.load_model(model_save_dir + 'model_' + model_name + '_th_{:.5f}'.format(0.00294) + '.model')
print('Testing model')
# use the model to make a forecast
test_xgb = pd.merge(y_test, X_test, on='datetime', how='left')
scaler = scaler.fit(train_xgb[selected_features.index.to_list()]) # Fit scaler
test_pred_xgb = xgb_model.predict(scaler.transform(test_xgb[selected_features.index.to_list()]))
plot_count = 1440

print('Evaluate model')
mse = mean_squared_error(y_test['tsd'], test_pred_xgb)
rmse = np.sqrt(mean_squared_error(y_test['tsd'], test_pred_xgb))
mae = mean_absolute_error(y_test['tsd'], test_pred_xgb)
mape = mean_absolute_percentage_error(y_test['tsd'], test_pred_xgb)*100
print("MSE: %.2f" % (mse))
print("RMSE: %.2f" % (rmse))
print("MAE: %.2f" % (mae))
print("MAPE: %.2f%%" % (mape))
Loading model
Testing model
Evaluate model
MSE: 94463722.92
RMSE: 9719.24
MAE: 8104.76
MAPE: 30.04%

Train GMLP¶

In [ ]:
# Define the model
gmlp_model = Sequential()
gmlp_model.add(Dense(32, activation='relu', kernel_initializer='random_uniform', input_shape=(len(selected_features.index.to_list()),)))
gmlp_model.add(Dense(IndependentNormal.params_size(1), activation='softplus', kernel_initializer='random_uniform'))
gmlp_model.add(IndependentNormal(event_shape=1))
opt = RMSprop(learning_rate=0.1)
gmlp_model.compile(optimizer=opt, 
        loss=negloglik,
        metrics=[MeanAbsolutePercentageError(), MeanAbsoluteError()])
In [ ]:
model_name = 'gmlp'
model_file = model_save_dir + 'model_' + model_name + '.h5'
if not os.path.exists(model_file):
    print('Training model')
    train_gmlp = pd.merge(y_train, X_train, on='datetime', how='left')
    scaler = StandardScaler()
    es = EarlyStopping(monitor='val_loss', patience=50)
    gmlp_model.fit(scaler.fit_transform(train_gmlp[selected_features.index.to_list()]), train_gmlp[y_train.columns[1:]], 
                   batch_size=100, epochs=30000, validation_split=0.1, callbacks=es)
    print('Saving model')
    gmlp_model.save_weights(model_file)
else:
    print('Loading model')
    gmlp_model.load_weights(model_file)
Loading model
In [ ]:
def plot_distribution(index: ndarray, samples: ndarray, mode: ndarray, lower_bound: ndarray, upper_bound: ndarray,
                      mean: ndarray, median: ndarray = None) -> None:
    """

    :rtype: object
    :param index:
    :param samples:
    :param mean:
    :param lower_bound:
    :param upper_bound:
    :param mode:
    :param median:
    """
    plt.figure(figsize=(15, 10))
    plt.scatter(index, samples, label='True value', alpha=.5, marker=".")
    plt.plot(index, mode, label='Mode value', color='r')
    plt.plot(index, mean, label='Mean value', color='m')
    if median is not None:
        plt.scatter(index, median, label='Median value', c='y', marker=".")
    plt.fill_between(np.squeeze(index), np.squeeze(lower_bound), np.squeeze(upper_bound), alpha=.3,
                     label='$CI$')
    plt.xlabel('Timestamp')
    plt.ylabel('Electricity demand (MW)')
    plt.title('Samples & fitted distribution\n+ 95% CIs')
    plt.legend()
    plt.show()
In [ ]:
print('Testing model')
# use the model to make a forecast
val_gmlp = pd.merge(y_val, X_val, on='datetime', how='left')
val_pred_gmlp = gmlp_model(scaler.transform(val_gmlp[selected_features.index.to_list()]))
plot_count = 1440*12
# Get predictions - E[Y|X]
y_hat = val_pred_gmlp.mean()
# Get SD
y_sd = val_pred_gmlp.stddev()

# Compute conf ints
y_hat_lower = y_hat - 2 * y_sd
y_hat_upper = y_hat + 2 * y_sd

# Plot data, predicted line and ~95% conf ints
plot_distribution(val_gmlp['datetime'].iloc[:plot_count], 
                  val_gmlp['tsd'].iloc[:plot_count], 
                  y_hat[:plot_count], 
                  y_hat_lower[:plot_count], 
                  y_hat_upper[:plot_count], 
                  y_hat[:plot_count])

print('Evaluate model')
mse = mean_squared_error(val_gmlp['tsd'], y_hat)
rmse = np.sqrt(mean_squared_error(val_gmlp['tsd'], y_hat))
mae = mean_absolute_error(val_gmlp['tsd'], y_hat)
mape = mean_absolute_percentage_error(val_gmlp['tsd'], y_hat)*100
print("MSE: %.2f" % (mse))
print("RMSE: %.2f" % (rmse))
print("MAE: %.2f" % (mae))
print("MAPE: %.2f%%" % (mape))
Testing model
Evaluate model
MSE: 11571126.39
RMSE: 3401.64
MAE: 2772.52
MAPE: 9.89%

The error exhibits a higher magnitude during the summer season.

In [ ]:
print('Testing model')
# use the model to make a forecast
test_gmlp = pd.merge(y_test, X_test, on='datetime', how='left')
test_pred_gmlp = gmlp_model(scaler.transform(test_gmlp[selected_features.index.to_list()]))
plot_count = 1440
# Get predictions - E[Y|X]
y_hat = test_pred_gmlp.mean()
# Get SD
y_sd = test_pred_gmlp.stddev()

# Compute conf ints
y_hat_lower = y_hat - 2 * y_sd
y_hat_upper = y_hat + 2 * y_sd

# Plot data, predicted line and ~95% conf ints
plot_distribution(test_gmlp['datetime'].iloc[:plot_count], 
                  test_gmlp['tsd'].iloc[:plot_count], 
                  y_hat[:plot_count], 
                  y_hat_lower[:plot_count], 
                  y_hat_upper[:plot_count], 
                  y_hat[:plot_count])

print('Evaluate model')
mae = mean_absolute_error(test_gmlp['tsd'], y_hat)
mape = mean_absolute_percentage_error(test_gmlp['tsd'], y_hat)*100
mse = mean_squared_error(test_gmlp['tsd'], y_hat)
rmse = np.sqrt(mean_squared_error(test_gmlp['tsd'], y_hat))
print("MSE: %.2f" % (mse))
print("RMSE: %.2f" % (rmse))
print("MAE: %.2f" % (mae))
print("MAPE: %.2f%%" % (mape))

# Residual analysis
y_res = y_hat - np.array(test_gmlp['tsd']).reshape(-1,1)
fig, axes = plt.subplots(4, 1, figsize=(15, 15))
axes[0].scatter(test_gmlp['date_period'], y_res)
axes[1].scatter(test_gmlp['date_month'], y_res)
axes[2].scatter(test_gmlp['date_dayofyear'], y_res)
axes[3].scatter(test_gmlp['date_weekday'], y_res)
fig, axes = plt.subplots(2, 2, figsize=(15, 15))
axes[0,0].scatter(test_gmlp['datetime'], y_res)
axes[0,1].scatter(y_hat, y_res)
sns.kdeplot(y_res, ax=axes[1,0])
axes[1,1].scatter(y_hat, test_gmlp['tsd'])
plt.show()
Testing model
Evaluate model
MSE: 3955876.87
RMSE: 1988.94
MAE: 1586.76
MAPE: 5.34%

Models evaluation¶

In this step the GMLP model robustness is evaluated using the cross validation procedure.

In [ ]:
X_cv = df_features.copy()
y_cv = df_demand.copy()
fold = 0
scores = []
metrics = ['MSE', 'RMSE', 'MAE', 'MAPE']
df_cv = pd.DataFrame(columns=metrics, index=range(5))
# Perform cross-validation
start_year = test_split.year - 5
for fold in range(5):
    # Split the data into training and validation sets
    fold_date = datetime(start_year + fold, 1, 1)
    print('--------------------')
    print('Fold ' + str(fold))
    print('Training from ' + X_cv['datetime'].min().strftime('%Y/%m/%d') + ' to ' + fold_date.strftime('%Y/%m/%d'))
    print('Testing from  ' + fold_date.strftime('%Y/%m/%d') + ' to ' + test_split.strftime('%Y/%m/%d'))
    print('--------------------')
    X_train_fold = X_cv.loc[X_cv['datetime'] < fold_date]
    y_train_fold = y_cv.loc[y_cv['datetime'] < fold_date]
    X_val_fold = X_cv.loc[(X_cv['datetime'] >= fold_date) & (X_cv['datetime'] < test_split)]
    y_val_fold = y_cv.loc[(y_cv['datetime'] >= fold_date) & (y_cv['datetime'] < test_split)]
    # Set model filename
    model_name = 'gmlp'
    model_file = model_save_dir + 'model_' + model_name + '_cv_' + str(fold) + '.h5'
    # Define model
    gmlp_model = Sequential()
    gmlp_model.add(Dense(32, activation='relu', kernel_initializer='random_uniform', input_shape=(len(selected_features.index.to_list()),)))
    gmlp_model.add(Dense(IndependentNormal.params_size(1), activation='softplus', kernel_initializer='random_uniform'))
    gmlp_model.add(IndependentNormal(event_shape=1))
    # Compile and train the model
    opt = RMSprop(learning_rate=0.1)
    gmlp_model.compile(optimizer=opt, 
            loss=negloglik,
            metrics=[MeanAbsoluteError()])
    if not os.path.exists(model_file):
        print('Training model')
        train_gmlp = pd.merge(y_train_fold, X_train_fold, on='datetime', how='left')
        scaler = StandardScaler()
        es = EarlyStopping(monitor='val_loss', patience=50)
        gmlp_model.fit(scaler.fit_transform(train_gmlp[selected_features.index.to_list()]), train_gmlp[y_train.columns[1:]], 
                    batch_size=100, epochs=300, validation_split=0.2, callbacks=es)
        print('Saving model')
        gmlp_model.save_weights(model_file)
    else:
        print('Loading model')
        gmlp_model.load_weights(model_file)
    # Evaluate the model on the validation set
    val_fold_gmlp = pd.merge(y_val_fold, X_val_fold, on='datetime', how='left')    
    print('Testing model')
    # use the model to make a forecast
    val_pred_gmlp = gmlp_model(scaler.transform(val_fold_gmlp[selected_features.index.to_list()]))
    plot_count = 1440
    # Get predictions - E[Y|X]
    y_hat = val_pred_gmlp.mean()
    # Get SD
    y_sd = val_pred_gmlp.stddev()

    # Compute conf ints
    y_hat_lower = y_hat - 2 * y_sd
    y_hat_upper = y_hat + 2 * y_sd

    print('Evaluate model')
    mae = mean_absolute_error(val_fold_gmlp['tsd'], y_hat)
    mape = mean_absolute_percentage_error(val_fold_gmlp['tsd'], y_hat)*100
    mse = mean_squared_error(val_fold_gmlp['tsd'], y_hat)
    rmse = np.sqrt(mse)
    df_cv.iloc[fold, :] = np.array([mse, rmse, mae, mape])
    print("MSE: %.2f" % (mse))
    print("RMSE: %.2f" % (rmse))
    print("MAE: %.2f" % (mae))
    print("MAPE: %.2f%%" % (mape))
    
    fold += 1
--------------------
Fold 0
Training from 2009/01/01 to 2018/01/01
Testing from  2018/01/01 to 2023/01/01
--------------------
Loading model
Testing model
Evaluate model
MSE: 12909420.71
RMSE: 3592.97
MAE: 2929.37
MAPE: 10.07%
--------------------
Fold 1
Training from 2009/01/01 to 2019/01/01
Testing from  2019/01/01 to 2023/01/01
--------------------
Loading model
Testing model
Evaluate model
MSE: 14338705.79
RMSE: 3786.65
MAE: 3150.55
MAPE: 11.10%
--------------------
Fold 2
Training from 2009/01/01 to 2020/01/01
Testing from  2020/01/01 to 2023/01/01
--------------------
Loading model
Testing model
Evaluate model
MSE: 7534314.98
RMSE: 2744.87
MAE: 2137.82
MAPE: 7.63%
--------------------
Fold 3
Training from 2009/01/01 to 2021/01/01
Testing from  2021/01/01 to 2023/01/01
--------------------
Loading model
Testing model
Evaluate model
MSE: 5860005.86
RMSE: 2420.74
MAE: 1897.86
MAPE: 6.68%
--------------------
Fold 4
Training from 2009/01/01 to 2022/01/01
Testing from  2022/01/01 to 2023/01/01
--------------------
Loading model
Testing model
Evaluate model
MSE: 9651613.76
RMSE: 3106.70
MAE: 2495.94
MAPE: 8.90%

Forecast visualization¶

In this section, plot are created to visualize the prediction of the GMLP model over the year 2023.

In [ ]:
# use the model to make a forecast
test_gmlp = pd.merge(y_test, X_test, on='datetime', how='left')
test_pred_gmlp = gmlp_model(scaler.transform(test_gmlp[selected_features.index.to_list()]))
# Get predictions - E[Y|X]
y_hat = test_pred_gmlp.mean()
# Get SD
y_sd = test_pred_gmlp.stddev()

# Compute conf ints
y_hat_lower = y_hat - 2 * y_sd
y_hat_upper = y_hat + 2 * y_sd

# Plot data, predicted line and ~95% conf ints
plot_distribution(test_gmlp['datetime'], 
                  test_gmlp['tsd'], 
                  y_hat, 
                  y_hat_lower, 
                  y_hat_upper, 
                  y_hat)
plot_count = 48*(7 + 1) # Plot first week and first day of year
# Plot data, predicted line and ~95% conf ints
plot_distribution(test_gmlp['datetime'].iloc[:plot_count], 
                  test_gmlp['tsd'].iloc[:plot_count], 
                  y_hat[:plot_count], 
                  y_hat_lower[:plot_count], 
                  y_hat_upper[:plot_count], 
                  y_hat[:plot_count])

Conclusions¶

In conclusion, this notebook presents the results and findings of the UK electricity demand forecasting project. Through comprehensive analysis and experimentation, several key insights have been obtained:

  • The project confirmed that accounting for seasonal features and weather information is crucial to minimize forecasting errors in electricity demand. Models that considered these factors showed improved performance compared to purely seasonal models.
  • Despite its popularity in certain applications, the Prophet model was found to have limitations for electricity demand forecasting. The model's training time was lengthy, and its performance did not match that of other models, making it less suitable for this specific task.
  • The XGB model demonstrated its capability to reduce feature dimensionality, simplifying the model and reducing complexity. This feature selection process aided in enhancing the overall performance of the forecasting models and the neural network.
  • The GMLP model show impressive generalization capacity by effectively predicting electricity demand for the years 2022 and 2023 using data from 2009 to 2021. This ability to handle data from previous years highlights the model's robustness.

After thorough evaluation, the GMLP model emerged as the best performer among the considered models. It exhibited the lowest values for key evaluation metrics, produced normally distributed residuals, and efficiently handled uncertainty. These characteristics solidify the GMLP model as the preferred choice for accurate electricity demand forecasting.